Seqana Challenge

Preface

Very short report because the time is running out. Also the css is not very appropriate but it’s what I had ready.

Correlation matrix

Here a heatmap showing the correlation matrix. From the nice patterns we can guess there are some relations between the variables, some are similar to each other etc.

Target distribution

Here the distribution of the target variable, not much to say, it looks a bit like the lognormal distribution.

    mean: 69.399, std: 67.241

Correlation with the target

A lot of features show a significative correlation with the target. I will keep only the most correlated ones for training the model.

This is however a very crude method for doing feature selection: Many features could also be strongly correlated between each others and therefore not adding information, or, not linearly correlated features could have a not-linear dependency with the target.

Most correlated features corr
soil_grids_soc_5_15 0.518
soil_grids_soc_0_5 0.496
soil_grids_ocd_5_15 0.491
soil_grids_nitrogen_0_5 0.480
soil_olm_soc_b0 0.477
soil_grids_ocd_0_5 0.464
soil_olm_soc_b10 0.455
soil_grids_ocs_0_30 0.449
LST_Day_1km_09_mean -0.429
soil_grids_soc_15_30 0.424

Pipeline

I created a very simple pipeline: it just selects the given features, applies a standard scaler and trains the model.

The model is trained on train_df and tested on test_df, generated randomly. No cross-validation or fancy stuff.

Some evaluation metrics are also computed: rmse, mae and r2.

Here the output for two example models, linear regression and random forest:

    Linear Regressor
    rmse: 53.925, mae: 31.834, r2: 0.366

    Random Forest
    rmse: 52.310, mae: 31.278, r2: 0.403

We can see that the random forest performs better. Both rmses are less than the target std, but far from 0.